This notebook SNIP is a very brief use case applying how we can use centroids from previous malware analysis to build a classifier of new malware. Our full dataset is described here: https://github.com/action-ai-institute/MABEL-dataset/tree/main You can retrieve a cleaned version ready for modeling here: https://github.com/solomonsonya/Artificial_Intelligence_Research/blob/main/malware_classification/nlp/archive/nlp_MABEL_dataset/standardized_dataset/_malware_family_master_dataset.7z
This notebook relies on the standardized_import_functions attribute provided in the dataset.

This notebook is built by Solomon Sonya

imports

Load MABEL Dataset SNIP

Vectorize all Malware Variant Centroids for Classification

Process the Data

Classify an Instance!

View Distance Matrix (this is argmin, i.e., the smallest distance is the centroid of greatest similarity)

View Similarities Matrix (this is argmax, i.e., the greatest percentage is the centroid of greatest similarity)

Different Classification Function

Interpretting the dataframes above.

Euclidean distance shows how far the other centroids are to the test case. A larger number means the cluster is more dissimilar than the test case. A smaller number (approaching to 0.0) indicates the cluster is very close (i.e., very similar) to the test case. Achieving a 0.0 Euclidean distance indicates the test case instance and the centroid are equivalent (i.e., identical).

Regarding cosine similarity</u>, a larger value (approaching 1.0) indicates the centroid is most similar to the test case. A value approaching 0.0 indicates the test case is completely dissimilar to the centroid. We select the largest similarity value here to identify the centroid that is most similar to the test case.

Happy Hunting!

-Solomon Sonya